Suicidal Tendencies and Ideation prediction using Reddit


2. Data Cleaning, Pre-processing and EDA

Now that we have obtained our data, we will take our first looks into it, seeking out missing values and choosing which parts of the data set will be useful for our classifier. We will also begin pre-processing the text data with natuaral language tools. This section concludes with some exploratory data analysis and visualisations.

2.1 Data Cleaning

Cutting down the dataset - As both sets have 100 columns, it'll be wise to choose a few columns that will be useful as our predictors.

Concatenation - As we have already created a "is_suicide" column indicating which subreddit the posts are from, we should concatenate both datasets together.

Imputation - If there are missing values, we should find a way to impute the data.

Note: Choosing relevant columns

Title and Post - We felt that the text data in the both the title and the post itself can potentially serve our classifier well.

Author's handle and number of comments - The author's name and the number of comments are curve ball choices. There just might be some connection between a user's handle and his/her psyche. There also might be a connection between the number of comments made.

URL - We left the URL in for reference. In case we'd want to look deeper into a particular post.

2.2 Pre-processing

As the posts are written by different humans, they come in different forms. In order to prepare the data for our classifier, we will have to take steps to pre-process the posts.

Build processing functions - We will build a processing function that will help change the text to lowercase, remove punctuations, reduce related words down to a common base word. With this functions, we can create a seperate column for our clean data.

We have cleaned and pre-processed our text data.

We now have three possible columns to build our classifier on: "author_clean", "selftext_clean", "title_clean". Now, on to EDA.

2.3 Exploratory Data Analysis

Some areas to check out:

Top words - This would be our obvious first step in our EDA. To peek and see what are the most used words in the title, posts and usernames.

Significant Authors - This might not really affect the classifier that we are using, but it might be worth it to check out users who post often and users who has posted on both subreddits.

Average length of posts - We have already noticed a significant number of posts with no words in r/SuicideWatch posts. We should dive in to check out what the number of words are in an average post in each subreddit.

2.3.1 Top Words (subreddit Posts)

We will visualise the most-used words in the respective subreddit posts in a word cloud and a barplot. We will first define a function that will help us with that. The function can be re-used for our titles and usernames later.

Reviewing "Top Words" in posts

Many similar words - We see a massive amount of similar words in our "top 20 words" from both subreddits like: wa, want, like, feel, life and people. This might make it difficult four our model.

Unique words from r/depression - "depression"

Unique words from r/SuicideWatch - "anymore"

2.3.2 Top Words (subreddit Titles)

We will visualise the most-used words the our respective subreddit titles in a word cloud and a barplot.

Reviewing "Top Words" in titles

Titles a better differentiator? - We still see a fair amount of similar words in our "top 20 words" in our titles, but much less than in Posts. As Titles in the two subreddits seem to differ more than Posts, it might serve as a better place for our models to hunt for features.

Unique words from r/depression titles - "depression", "depressed", "hate"

Unique words from r/SuicideWatch titles - "die", "kill", "suicidal", "live", "year"

The difference between Wanting and Feeling - It is interesting to note that the top word in r/SuicideWatch is the word "want" and it is used more than twice compared to the word "feel". In r/depression, the top word is "feel" and similarly, is used close to twice the amount of times compared to "want". This trend is also reflected in our visualisations for posts.

2.3.3 Top Words (subreddit Usernames)

We will visualise our most-used words in the respective subreddit usernames in a word cloud and a barplot. The choice of exploring usernames is an odd one as usernames are really short and it'll be really surprising if we can find something revealing in a sample pool of 1897 usernames. Let's try any way!

Reviewing "Top Words" in Usernames

Throwaways Dominate - There are 67 accounts(out of 1897) with the word "throwaway" in it. Throwaway accounts are temporary accounts used by users who want to maintain some anonymity. This is understandable given the subject matter of mental health.

Male-signifiers - Our top 20 list for depression is dominated by usernames with "mr", "man", "boy", "guy" in them. A check in on our full list of authornames revealed that there are more male-related names(68) than female ones(15). Its not a strong link our finding, but it is worth noting that there is a gender paradox in suicide studies, which observes the phenomenon of women having more suicidal thoughts while men commit suicide more frequently.

Marijuana - The 420, or "four-twenty", cannabis code made its way into our top 20 for depression. It might be worthwhile to look for links to drug-use in our posts.

Usefulness of Usernames? - Although our discoveries in Usernames are interesting, the "Top Words" occur in an average of 5 usernames. This might give our model some trouble. "Title" is still my top choice as the column to use to search for features.

2.3.4 Significant Authors

As an attempt to understand the community in the pages, we will attempt check out users who post often and users who has posted on both subreddits.

Reviewing Significant Authors

The Moderator - The redditor SQLWitch seems to be a moderating presence in both subreddits. Appearing a couple of times to remind redditors about rules like non-activism and not posting pro-suicide posts. SQLWitch's posts hints at the culture both communities, which are similar help-seeking, problem-airing forums.

Double-Posting - Aside from SQLWitch, there are more than 26 other users who have posted on both forums. For example, u/thathumbletrashcan posted on r/depression on March 4th that "I don't want to die, but I don't want to live anymore" . A day later, u/thathumbletrashcan visits the r/SuicideWatch forum and posts "I've finally given in. This is it I guess, I've finally grown the balls to fulfil my plan......all of you won't have to deal with me again." It seems that users treat r/SuicideWatch as a "late-stage" forum or a place to bid farewell to their fellow help-seekers. It might be useful to dig into the language used by these users prior to their "farewells" as the "shift" in their choice of words is exactly what we're looking for.

2.3.5 Length of posts

We have already noticed a significant number of posts with no words in r/SuicideWatch posts. We should dive in to check out what the number of words are in an average post in each subreddit.

Reviewing Length of Posts

Longer r/depression posts - The average length of r/depression posts is almost 130 words shorter than that of r/SuicideWatch. Although, as we can see from our scatterplot above, this figures might be skewed by some extremely long posts and the presence of empty posts in r/SuicideWatch.

2.3.6 Using Scattertext to visualise our corpus

As a final step in our EDA, we will use Scattertext to produce a user-friendly way of visualising our corpus in HTML.

NOTE: I've commented out the code in the next cell for "pd.to_csv" to prevent any accidental overwriting of the the saved dataset.**